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ABSTRACT 

In  this  article,  we  discuss  ways  of  using  "dummy  data”  and  mixed  estima¬ 
tion  (Theil  and  Goldberger,  1961)  to  bring  external  information  formally  into 
linear  regression  problems  when  the  experimental  data/model  are  inadequate. 
This  is  a  useful  way  of  attacking  the  same  practical  problems  that  motivated 
the  development  of  ridge  regression  (Hoerl  and  Kennard,  1970). 

The  main  practical  issues  considered  are  (i)  what  form  should  the 
"dummy  data"  take?,  and  (ii)  how  much  weight  should  it  be  given  relative  to 
the  experimental  data?  When  specific  prior  information  is  unavailable,  it  is 
suggested  that  the  dummy  data  should  reflect  a  preference  for  "stable"  re¬ 
sponse  functions  and  it  is  shown  how  this  can  be  accomplished.  Guidelines  for 
the  choice  of  the  weighting  parameter  k  (equivalent  to  the  choice  of  the 
ridge  parameter  in  ridge  regression)  are  given.  Upper  limits  for  k  are 
based  on  various  tests  of  compatibility  between  the  external  (dummy)  data  and 
the  experimental  data.  Lower  limits  for  k  are  determined  by  the  inadequacy 
of  the  data/model  for  the  purpose (s)  of  the  analysis.  Finally,  the  parallel 
Bayesian  approach  is  discussed,  with  emphasis  on  Box's  (1980)  framework  of 
model  estimation  and  criticism. 

AMS(MOS)  Subject  Classification:  62J99 

Key  Words:  Collinearity;  Dummy  data;  External  information  in  regression; 

Ridge  regression. 
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SIGNIFICANCE  AND  EXPLANATION 


We  discuss  ways  of  using  "dummy  data"  to  bring  external  information 
formally  into  linear  regression  problems  when  the  experimental  data/model  are 
inadequate.  The  main  practical  issues  are  (i)  what  form  should  the  "dummy 
data"  take?,  and  (ii)  how  much  weight  should  it  be  given  relative  to  the 
experimental  data?  When  specific  prior  information  is  unavailable,  it  is 
suggested  that  the  dummy  data  should  reflect  a  preference  for  so-called 
"stable"  response  functions,  and  it  is  shown  how  this  can  be  accomplished. 
Guidelines  for  the  choice  of  weighting  are  also  given.  Upper  limits  for  a 
weighting  parameter  are  based  on  various  tests  of  compatibility  between  the 
external  (dummy)  data  and  the  experimental  data.  Lower  limits  are  determined 
by  the  limits  of  inadequacy  of  the  data/model  for  the  purpose (g)  of  the 
analysis.  The  parallel  Bayesian  approach  is  also  discussed. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC ,  and  not  with  the  authors  of  this  report. 
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USING  EXTERNAL  INFORMATION  IN  LINEAR  REGRESSION,  WITH  A  COMMENTARY  ON  RIOGE  REGRESSION 


I.  MIXED  ESTIMATION  AND  BAYESIAN  METHODS 
Toby  J.  Mitchell*  and  Norman  R.  Draper 


1.  INTRODUCTION. 


In  this  report  we  focus  on  the  problems  that  arise  in  linear  regression  when  there  is 
insufficient  information  in  the  data/model  to  obtain  useful  estimates  of  the  regression 
coefficients  and/or  linear  functions  of  them.  Methods  for  augmenting  the  experimental  data 
with  external  “information"  lead  to  a  form  of  ridge  regression  which  differs  in  several 
major  respects  from  that  which  is  currently  practiced. 

We  shall  consider  models  for  the  observed  response  variable  y^,  given  associated 
"predictor  data"  d,  to  be  of  the  form: 


yd  “  nd  +  V 


(1.1) 


nd  =  £  6  xi(d)  ”  s’(d>&'  0.2) 

i  =  1 

the  p  predictor  variable  ?  are  specified  functions  of  d  ,  the  regression  coef¬ 

ficients  {8.}  are  unknown  constants,  and  e,  is  a  random  variable  with  mean  0  and 

-  i  a 

2  2 
variance  0  .  For  d  *  d*,  e  is  independent  of  e  .  The  error  variance  a  ,  assumed 

d  d 

to  be  constant  for  all  d  ,  and  is  generally  unknown. 

We  assume  that  (1.1)  and  (1.2)  hold  over  a  model  region  R  in  the  space  of  the 

predictor  data.  We  shall  refer  to  (1.2)  as  the  model  function,  although  it  is  really  a 

family  of  model  functions,  indexed  by  the  parameter  vector  (S  .  We  shall  view  the  problem 

of  "estimating"  8  to  be  the  problem  of  choosing  from  this  family  a  single  model  function 

which  is  to  be  used  to  make  inferences  about  n.  given  d  6  R  ,  or  more  generally,  to  make 

d 

inferences  about  specified  Linear  functions  of  the  8's  • 
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Given  a  sample  of  n  independent  observations  (y, ,v2’ *** ,yn^'  where  yu  *s 

associated  with  known  du,  u  *  1,2,  (1.1)  and  (1.2)  imply  the  statistical  model 

for  the  experimental  data: 

E(^)  =  X  g,  (1.3) 

V(,y)  =  °2I.  .  0.4) 

where  the  matrix  jc  is  n  *  p  with  (u,i)th  element  Xui  equal  to  xi<du)  and  £  is  the 
n  *  n  identity  matrix. 

The  most  common  estimation  procedure  follows  from  the  least  squares  criterion:  choose 
J5  to  minimize  the  sum  of  squares:  SS(g)  =  I*  -  J&gl2.  This  yields  the  least  squares 
estimator 

i  *  .  <’•*> 

where  we  assume  that  X'X  is  non-singular  so  that  g  is  unique.  (If  X'X  were  singular, 
(l.b)  would  contain  a  generalized  inverse,  and  g  would  not  be  unique.) 

The  least-squares  estimator  may  be  justified  on  several  statistical  qrounds,  the  most 
prominent  of  which  are: 

(i)  The  Gauss-Markov  Theorem:  Among  all  unbiased  estimators  for  U  =  t'g  that  are 
linear  in  the  y's,  £'g  has  minimum  variance,  no  matter  what  £  is.  (An 
alternative  version  that  does  not  impose  unbiasedness  is:  Among  all  linear 
estimators  tor  li,  t'g  minimizes  the  maximum  mean  squared  error  of  estimation 
(Barnard,  1977).)  Since  the  purpose  of  the  model  is  often  to  estimate  linear 
functions  of  the  B's,  this  is  generally  considered  a  strong  argument  in  favor 
of  least  squares,  although  the  restriction  to  linear  estimators  seems  arbitrary. 

(ii)  If  £  has  a  multinormal  distribution  with  mean  and  variance-covariance  matrix 
given  by  (1.3)  and  (1.4),  then  g  is  the  maximum  likelihood  estimator  of 

g  •  Intuitively,  we  would  expect  the  model  function  that  is  favored  most  by 
the  available  data,  i.e.,  the  maximum  likelihood  model,  to  serve  at  least  as 
well  as  any  other  in  making  inferences  about  general  linear  functions  of  the 
8 's. 
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These  arguments  in  favor  of  least  squares  lose  force  if  the  model  as  given  by  (1,1) 
and  (1.2)  is  inadequate,  or  if,  in  the  case  of  (ii),  %  is  not  normally  distributed. 

Minimum  bias  estimation  (Karson,  Manson,  and  liader,  1969;  Kupper  and  Meydrech,  1974)  and 
robust  regression  (Huber,  1981)  are  alternatives  to  least  squares  in  such  situations.  We 
shall  not  consider  these  methods  here,  since  we  shall  assume  that  the  model  assumptions 
(including  normality,  when  needed)  are  satisfied.  Even  then,  least  squares  estimation  may 
seem  unsatisfactory  because  the  variances  of  the  $'s  may  be  extremely  hiqh  for  certain 
configurations  of  the  d*s.  Various  alternative  estimation  methods  have  been  proposed  for 
these  situations  also;  predominant  among  them  are  variable  selection  and  ridge 
regression.  (See  Hocking  (1976)  toi  an  excellent  overview  of  these  methods.) 

The  motivation  for  our  work  here  is  our  dissatisf action  with  the  way  ridge  regression 
is  perceived  and  used.  There  is  an  enormous  literature  on  this  subject,  stemming  from  the 
basic  papers  of  Hoerl  and  Kennard  (1970a,  1970b).  Unfortunately,  much  of  this  work 
presents  ridge  regression  primarily  as  a  biased  estimation  technique,  based  on  the 
experimental  data  alone,  that  is  "superior"  in  some  sense  to  least  squares  estimation. 

This  approach  has  misled  users  and  authors  alike.  The  main  pronlem  in  situations  where 
ridge  regression  is  often  invoked  is  one  of  lack  of  information  in  the  data  (with  respect 
to  the  choice  of  a  single  model  from  the  given  family),  and  not  with  least  squares 
estimation  per  se.  External  information  or  additional  assumptions  are  needed;  what  form 
snould  these  take  and  how  should  they  be  implemented?  These  are  the  topics  of  Part  I  of 
this  report.  We  have  deliberately  written  Part  1  almost  as  it  ridge  regression  did  not 
exist,  in  order  to  set  out  first  the  fundamentals  that  underlie  our  point  of  view. 

However,  we  do  consider  only  the  kind  of  external  information  that  leads  to  "ridge-type" 
estimators.  In  Part  II,  we  shall  otter  a  commentary  on  some  specific  aspects  of  ridge 
regression,  adding  further  support  to  recent  criticisms  of  various  ridge  regression  myths 
(This ted  1976;  Draper  and  Von  Nostrand,  1979;  .Smith  and  Campbell,  1 980;  Smith,  198U; 

Eger ton  and  haycock,  1981). 


Certainly  the  idea  of  considering  ridge  regression  as  a  means  of  hrinqina  external 
information  into  the  estimation  procedure  is  not  new.  Hoerl  and  Kennard  (lu70a)  mentioned 


a  Bayesian  derivation  of  ridge  estimation,  and  Lindley  and  Smith  (1972)  have  provided  an 
extensive  Bayesian  treatment.  We  shall  rely  more  on  the  "mixed  estimation"  approach  of 
Theil  and  Goldberger  (1961),  in  that  the  external  information  is  introduced  by  means  of 
dummy  data  rather  than  prior  distributions.  We  offer  (in  Section  3)  a  rationale  for 
specifyin  such  data  in  the  absence  of  specific  external  information  about  the  regression 
coefficients.  In  Section  4,  we  suggest  some  guidelines  for  selecting  the  amount  of 
influence,  or  weight,  to  give  the  dummy  data;  this  is  equivalent  to  the  choice  of  the  ridge 
parameter  k  in  ridge  regression.  Bayesian  methods  are  discussed  in  Section  5  with 
emphasis  on  Box's  (1980)  illuminating  approach  to  model  estimation  and  criticism. 
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2 


BRINGING  IN  EXTERNAL  INFORMATION 


2.1.  The  need  Cor  external  information. 

Suppose  we  wish  to  estimate  one  or  more  linear  functions  of  the  regression 

coefficients.  We  shall  denote  these  by  u,u  ,•••,!!  ,  where  u .  =  t ! 6  and  the  t's  are 

1  2  in  i 

known.  If  the  response  vector  %  is  normally  distributed: 

X  ~  (2.1) 

the  set  of  “acceptable"  values  of  3  lies  inside  the  ellipsoid: 

(fi  -  £)'X'X<£  -  £)  <  a  (2.2) 

for  some  suitably  chosen  positive  constant  a  .  This  follows  from  a  standard  livelihood 

argument  which  requires  only  that  every  "acceptable"  g  has  a  likelihood  greater  than 

that  of  every  “unacceptable"  g  .  By  maximizing  and  minimizing  t^g  subject  to  (2.2),  we 

find  that  the  "acceptable"  values  for  must  be  in  the  interval  with  end  points: 

•S-fi  *  <a  eMX'xr’t.)’^.  (2.3) 

2  2  2 

The  choice  a  =  a  X, ,a  In  (2.3)  where  x1-a  is  the  upper  a%  point  of  the  chi-squared 

distribution  with  one  degree  of  freedom,  results  in  the  usual  100(1-a)%  confidence 
2 

interval  for  u^  when  a  is  known.  The  other  factor  that  determines  the  interval  width, 
name ly , 

w.  =  t'(X'X)’'t.  (2.4) 

is  affected  jointly  by  the  model  and  "design",  (i.e.,  the  configuration  of  the  d's),  which 
express  themselves  through  X  . 

If  w ^  in  (2.4)  is  such  that  the  interval  (2.3)  is  too  wide  to  be  of  value,  we 
conclude  that  the  data  and  model  assumptions  alone  are  inadequate  for  the  purpose  of 
estimating  =  t’g  ,  Ideally,  we  would  like  to  have  more  experimental  data,  but  we  shall 
assume  here  that  this  is  not  possible  and  that  our  only  recourse  is  to  bring  in  external, 
often  vague,  information  about  g.  For  another  purpose  (i.e.,  another  t^),  of  course, 
there  might  be  no  need  at  all  for  additional  information  or  assumptions. 

We  have  deliberately  avoided  presenting  wi  as  the  variance  of  t'S  (divided  by  o2) 
to  emphasize  that  its  role  nr,  an  indicator  of  the  information  provided  by  the  data/model 
with  respect  to  is  independent  of  whatever  point  estimation  procedure  for  3  is 
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used.  If  wl  is  "too  wide",  the  problem  cannot  be  alleviated  uy  blaming  least  squares 
estimation  and  seeking  a  "better*  estimator  in  some  class  of  biased  estimators.  External 
information  must  be  used. 


Then  (2.8)  can  be  written 


I  =  (X*X  +  kTl'Nx'x  +  kTJ3*).  (2.13) 

The  variance-covariance  matrix  of  g  is 

V(|)  =  (W  +  W0>_1=  +  kT)-1.  (2.14) 

2 

When  o  is  not  known,  Theil  and  Goldberger  (1961)  and  Theil  (1963)  proposed  to 
2  2 

replace  o  by  s  ,  the  residual  mean  square  from  the  least  squares  analysis  of  the 

experimental  data.  They  also  mentioned  an  iterative  approach  in  which  a  new  estimate  of 
2 

a  and  hence  a  new  weight  matrix  W  is  derived  from  the  residual  mean  square  after  each 
iteration.  Under  this  procedure,  successive  estimates  of  3  converge  to  the  maximum 
likelihood  estimate  of  3  based  on  all  the  data.  Unfortunately,  this  estimate  of  g  is 

not  linear  in  the  observations,  so  further  statistical  results  are  difficult  to  obtain. 

2 

We  shall  adopt  here  a  more  tractable  approach,  and  treat  o as  an  unknown  multiple 
of  02: 


a2  =  k'V.  (2.15) 

For  fixed  k,  this  will  allow  us  to  make  inferences  using  standard  weighted  least  squares 
procedures,  where  the  experimental  data  are  given  weight  1  and  the  dummy  data  are  given 

weight  k.  Thus  g  is  given  by  (2.13)  with  variance-covariance  matrix  given  by  (2.14). 

2 

The  estimate  of  c  is  based  on  the  weighted  residual  mean  square 

a2  =  (vs2  +  q)/(v  +  nQ)  (2.16) 


where 


q 


(£-&  )  •  X' X( X' X  +  kT)-  kT(jj)-g), 


(2.17) 


and  v  =  n-p  is  the  number  of  degrees  of  freedom  for  s 
functions  of  the  3’s  can  be  constructed  in  the  usual 
tion  with  v+nQ  degrees  of  freedom.  Thus,  a  100(1-a)% 
u  =  t’B  is  given  by 

t’l  t  ‘v+n  .a/2l£,(*'*  +  V2  o. 

We  note  that  if  X  has  been  "centered  and  scaled" 


2 

•  Confidence  intervals  for  linear 
way,  using  Student's  t  distribu- 
confidence  interval  for 

(2.18) 

so  that  X* X  is  in  the  form  of  a 


correlation  marix,  and  if  T  =  ^  and  j3  =0,  then  g  reduces  to  the  standard  Hoerl- 
Kennari  (1970a)  ridge  estimator.  Proponents  of  classical  ridge  regression,  while  admitting 


that  ridge  regression  can  ba  viewed  as  the  result  of  using  external  information,  would  deny 
the  formal  use  of  that  information  in  determining  confidence  intervals.  They  would  then 
arrive  at  the  standard  'least  squares'  confidence  intervals  rather  than  (2.18).  (See 
Obenchain,  1977.) 

The  difficulties  in  finding  a  'best*  value  for  k  are  well  known  to  followers  of 
ridge  regression.  Our  approach  here  will  be  to  regard  k  as  fixed  by  assumption,  then 
check  that  assumption  for  compatibility  with  the  experimental  data,  much  in  the  spirit  of 
"model  criticism"  put  forward  by  Box  (1980).  We  shell  discuss  these  procedures  in  Section 
4,  after  we  deal  with  the  problem  of  specifying  J  and  £• 
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3.  USE  OF  PRIOR  PREFERENCES  TO  DETERMINE  T  and  0  . 

3.1.  Prior  preferences. 

We  shall  assume  that  we  can  establish  {by  the  approach  to  be  suggested  in  Section  3.2) 
a  convention  for  ranking  proposed  values  of  3a  priori,  where  the  ranking  is  such  that 
3^ 1  ^  is  preferred  to  0^^  if  and  only  if 


(S( 1 ’-B0) .T°(e(1 >.e°)  <  (g(2)-3°)1TO(£(2)-30) 


(3.1  ) 


for  specified  0^  and  non-negative  definite  T°.  To  ensure  that  the  likelihood  for  the 

*  0  0 

dummy  data  reflects  these  preferences,  we  would  choose  3  =0  and  T  =  T  in  our  mixed 


estimation  procedure. 

A  special  case  of  this  kind  of  rule  is  Hoerl  and  Kennard's  (1970)  preference  for 
"short"  parameter  vectors/  expressed  in  (3.1)  as  3^  =0/  T^  =  where  X'X  is  in 
correlation  form.  This  particular  preference  has  been  criticized  (rightly,  we  believe) 
because  the  length  of  £  depends  on  the  parametrization  of  the  model.  (See  Smith  and 
Campbell  (1980).). 

Here  we  present  a  rationale  for  choosing  T°  and  0^  which  is  in  the  sane  spirit  as 
the  Hoer L-Kennard  preference,  and  at  the  same  time  produces  a  family  of  estimators  that  is 
invariant  under  reparametrizations  of  the  model.  This  rationale,  which  is  based  on  the 
concept  of  "stability”  of  the  expected  response  also  extends  naturally  to  quadratic 

and  higher  order  models.  In  some  circumstances,  of  course,  one's  prior  information  may 
permit  the  choice  of  T°  and  0°  to  be  made  directly. 

3.2.  "Stable  response"  preferences. 

Suppose  Rs  is  a  selected  subregion  of  the  model  region  R  .  We  define  the 

instability  T  of  the  response  function  n,  over  R,.  to  be  the  variance  of  n.  induced 
-  d  ±  d 

by  a  uniform  probability  distribution  <Md)  over  Rg  .  (This  is  just  the  squared  deviation 
of  from  its  average,  integrated  over  Rg  .  We  assume,  of  course,  that  and  the 

functions  x^M)  are  sufficiently  well-behaved  for  T  to  exist.)  We  emphasize  that  V 
is  an  inherent  property  of  the  model  function  ;  it  has  nothing  to  do  with  the  .lata. 

From  (1.2),  T  can  be  written  as  a  quadratic  form: 

&  (3.2) 


where 


U  -  V^(x(d) )  -  E^(x(d)  -  £)(x(d)  -  £)*  -  M  -  ££' 

where 


(3.3) 


M  -  E. (x(d)x'(d)) 
♦  - 

and 


(3.4) 


£  -  E  (x(d) ).  (3.5) 

♦  ~ 

In  the  absence  of  any  other  external  Information  about  £  ,  we  sight  simply  express  a 

preference  for  those  £'s  that  give  a  more  stable  response  t\d  over  Rg,  i.e.  we  choose 

T°  »  u  and  £°  «  0  in  (3.1).  This  leads,  via  mixed  estimation,  to  a  ‘stable  response* 

* 

(SR)  estimator  of  form  (2.13)  with  T  *  U  and  £  ■  0  ,  i.e., 

ISR  ”  (3.6) 

Now  suppose  the  model  function  (1*2)  is  reparametrized  as  follows: 

P 

n  -  I  «i.f .  (d)  -  f  '<d)w  0.7) 

d  i-1  1  1 

where 

f'(d)  -  x ' (d )A,  u  -  A_1£,  (3.8) 

A  being  a  nonsingular  pxp  matrix.  Then  the  appropriate  U-matrix  iss 


U,  -  V  (f (d) )  -  A’UA  ,  (3.9) 

9  ~  ~  ~~ 

the  predictor  data  matrix  is  XA  ,  and  the  SR  estimator  for  w  is: 
wgR  =  (A'X'XA  +  kA'UA) ”'a* X 'y 

-  A-'  (X'X  +  kUj'Vx  (3.10) 

Thus  the  SR  estimator  is  invariant  under  reparameterizations  of  the  model.  This  is  what  we 

would  expect,  since  the  instability  of  the  model  function  depends  only  on  n.  and  R_  , 

d  s 

neither  of  which  is  affected  by  reparameterization. 

It  is  interesting  to  examine  the  influence  of  various  choices  of  the  "region  of 
stability"  Rs  on  U  ,  and  hence  on  £gR.  In  the  three  cases  that  follow  we  suppose 

that  there  is  a  constant  term  in  the  model,  so  it  will  be  convenient  to  change  notation 
slightly  by  writing  the  model  function  (1.2)  as 
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80  +  i  8.x.(d)  =  8q  +  x' !d)B 


(3.11 ) 


i=»1 


and  the  expected  response  vector  (1,3)  as 

El*)  =  4.80  +  (3.12) 

where  is  a  vector  of  1's.  Note  that  p  is  now  1  less  than  the  total  number  of 

coefficients  in  the  model. 

Case  1.  Rg  consists  of  the  data  points  d^dg,  """,d  themselves. 

Since  4(d)  *  1/n  at  every  du  in  the  data  set,  (3.5)  yields 


=  (1,X  ,X  ,•••  ,Xp),  where  \  “  1  *ui  /n 

u= 1 


It  then  follows  from  (3.3)  that 


U  « 


1  •  • 
~  X'x 


(3.13) 


where  X  is  obtained  from  X  by  “centering"  the  columns,  i.e., 

x  =  (i  -  -  i r )x  . 

~  *  n  ~~  ~ 

It  can  be  shown  that  the  corresponding  SR  estimator  (3.6)  is  given  by: 


(3.14) 


®SR 

8„ 


n+k 


-  ?  -  J,  ?i,SR  \ 


(3.15) 


0,SR 

The  effect  of  the  external  data  here  is  to  "shrink"  the  least  squares  estimator 
Case  1.  "Factorial 


Rg  defined  by  the  levels  of  the  observed  predictor  variables  - 


first  order  model. 


We  shall  assume  here  that  n,  is  a  first-order  model  function,  i.e.  the  predictors 

d  - 

xl(d)  in  (3.11)  are  not  functionally  related.  We  now  define  the  region  of  stability  Rs 

to  correspond  to  the  factorial  "design"  generated  by  the  observed  levels  of  the  individual 

predictors.  Let  fl.  =  ( X, .,  X_  X  i.e.,  0.  is  the  set  of  observed  values  of  x( , 

i  li  2i  m  i  1 

and  let  g  be  the  matrix  whose  rows  are  elements  of  the  lattice  x  *  •••  x  •  This 

lattice  has  elements,  some  of  which  may  be  Identical.  For  example. 
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1  3.5 


1  3.5 


He  take  R#  to  be  the  set  of  points  {or}  corresponding  to  the  rows  of  fl  ,  with 

p 

♦(w. )  »  1/n  .  (Note,  however,  that  under  this  definition  R  depends  on  the 
J  8 

parametrization  of  the  laodei ,  so  the  invariance  result  (3.10)  will  not  hold.) 
Under  the  above  conditions,  U  as  defined  by  (3.3)  -  (3.S)  is: 


where  S  is  a  diagonal  natrix  with  ith  diagonal  element 

Si  ’  ^  tXui  '  *i)2  '  i  *  1,2,*,,,P  .  (3.17) 

u 

To  see  this,  note  that  U^+,  is,  by  <3. 3), the  variance  of  Xj^  with  respect  to  the 

distribution  ♦  .  Under  4  ,  the  probability  at  Xu^  is  just  1/n  ,  so  the  mean  of  x^  is 
and  the  variance  of  x^  is  S^/n.  Moreover,  because  of  the  independence  of  x^  and 
x^  for  all  pairs  i  *  j,  the  off-diagonal  terms  of  S  are  0  .  Finally  the  constancy 
of  xQ  -  1  under  +  results  in  the  zero  values  in  the  first  row  and  first  column  of  U  . 
With  U  given  by  (3.16),  the  SR  estimator  (3.6)  is  given  by: 

I  -  (Z'Z  +  ki>'V* 


®0,SR  “  *  "  ®i,SR  *i 

where  Z  is  the  "correlation  form*  of  X  ,  i.e., 

.  - 1/ 

7.  *  X  S  'Z 
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Note  that  (3.18)  is  the  same  as  the  Hoerl-Kennard  family  of  ridge  estimators.  We 
consider  this  derivation  to  be  a  curosity  rather  than  a  rationale  for  using  the  Hoerl- 
Kennard  estimators.  Generally,  we  would  prefer  to  define  Rg  so  that  it  depends  neither 
on  the  observed  data  nor  on  the  pararaetrization  of  the  model. 

Case  3.  Rectangular  Rg  -  first  order  model. 

In  most  regression  problems,  the  predictor  data  d  take  the  form  of  a  vector  of  r 

"basic"  predictor  variables,  and  the  region  of  stability  Rg  can  be  taken  to  be  some 

geometrically  convenient  region  in  R.  In  this  example,  we  again  assume  a  first-order 

model,  so  p  =  r,  and  we  suppose  that  the  predictor  variables  have  been  coded  so  that 

-1  <  x . (d )  <  1  <==>  d  6  R  .  (3.20) 

3  B 

Taking  4>(d)  to  be  uniform  over  Rg  ,  we  find 
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Me  offer  the  'stable  response*  preference  primarily  as  a  route  to  follow  In  the 

absence  of  specific  prior  inforaation  about  individual  B's  or  about  the  expected  response 

n  at  certain  points  d  R.  The  notion  of  instability  of  the  true  response  surface  over 
a 

a  specified  region  seems  to  us  to  lend  itself  to  physical  interpretation  euch  more  readily 
than  the  notion  of  length  of  the  vector  of  regression  coefficients,  even  though  the  choice 
of  region  of  stability  is  still  rather  arbitrary.  The  operational  advantage  of  the  SR 
method  is  that  one  can  apply  it  without  having  to  worry  about  paraaetrisatlon  or  about  the 
specific  meaning  of  individual  coefficients. 

3.3.  Setting  up  the  dummy  data. 

External  preferences  of  the  form  (3.1)  can  be  transformed  into  dummy  data  for  mixed 
estimation  by  choosing  XQ  ,  V^,  and  to  satisfy  (2.13)  and  (2.7)  with  X  m  Xq  aB<* 

JJ*  -  £°.  The  nQ  rows  of  £0  could,  for  example,  be  the  eigenvectors  corresponding  to 
the  positive  eigenvalues  A^  i  -  1,2,*”,n0,  of  T»  V^1  would  then  be  a  diagonal  matrix 
with  those  eigenvalues  on  the  diagonal.  Since  the  analysis  depends  only  on  T  and  £  , 
the  only  point  in  choosing  XQ,  VQ  and  ^  in  this  way  is  to  take  advantage  of  standard 
computer  routines  for  doing  weighted  least  squares.  The  weight  for  the  ith  dummy 
observation  would  be  kA^j  each  experimental  observation  would  have  weight  1. 
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4.  SOME  GUIDELINES  FOR  CHOOSING  k. 

4.1.  Omnibus  compatibility  test. 

—  ~2 

Throughout  this  section,  it  will  be  convenient  to  express  B,  a  ,  and  q,  defined 
respectively  by  (2.13),  (2.16),  and  (2.17),  as  functions  of  k. 

Although  we  doubt  that  a  "best11  procedure  for  choosing  k  exists,  we  can  present  some 
guidelines.  For  example,  we  shall  certainly  not  want  to  choose  k  so  large  that  the 
external  (dummy)  data  is  incompatible  with  the  experimental  data.  Theil's  (1963)  test 

statistic  for  detecting  incompatibility  is,  after  adapting  for  fixed  k  rather  than  fixed 
2 

and  expressing  in  terms  of  £  and  J: 


♦(k)  -  q(k)/nQs  , 


(4.1) 

where  q(k)  is  given  by  (2.17).  Under  the  assumed  model  for  the  experimental  and  dummy 


data,  ^(k)  is  distributed  as  F  where  ng  is  the  rank  of  T.  Large  values  of 

V 

i|i(k)  indicate  incompatibility. 

This  test  for  compatibility  is  equivalent  to  the  usual  test  of  the  hypothesis: 

E(*0>  »  Xgg  against  the  alternative:  E(£0)  *  ~o&*  a  test  frequently  used  in  regression 
to  evaluate  a  subset  of  the  observations  (in  this  case  yQ)  jointly  for  the  presence  of 
outliers.  (Gentleman  and  Wilk,  1975.) 

He  propose  that  <i(k)  be  used  to  set  an  upper  limit  k^  on  k,  chosen  so  that 


i|i(ka)  >  F 


n0.v;a 


(4.2) 


for  some  suitably  chosen  percentage  point  (e.g.  o  =  .10)  of  the  F^  y  distribution. 

Remark  1 .  If  k  <  ka  (i.e.,  if  the  external  information  “passes"  the  compatibility 
test)  then  B(k)  is  within  the  usual  (100(1-a)t  confidence  region  for  £  based  on  the 
experimental  data  alone,  i.e., 

<  P  s2Fp,v;a-  <4‘3) 

That  is,  jJ(Jt)  is  "a-acceptable"  in  the  sense  of  McCabe  (1978).  This  follows  from  the 


fact  that,  for  k  *  k^, 

(lui-gj'radtkl-i)  <  q<*>  4  V2Fn0,v;a  ‘  ps2pp,v;a 

since  it  can  be  shown  that  n„F  <  pF  . 

0  n  ,v,«  c  p,v;a 


(4.4) 
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Remark  2.  The  choice  of  k  associated  with  the  estimator  RIDGM,  which  performed  well 

in  the  large  simulation  study  of  ridge  estimators  conducted  by  Dempster,  Schatzoff  and 

* 

Wermuth  (1977)  is  equivalent  to  the  solution  of  ^(k)  =  1,  where  £  =  0  and  T  -  £.  For 
any  reasonable  a,  therefore,  the  k  for  RIDGM  would  be  less  than  kQ. 

Remark  3,  As  a  first  approximation  to  kQ,  one  could  approximate  ^i(k)  by  a  linear 
function  near  k  •  0>  in  this  neighborhood 

<1<|C>  *  (4.5) 

SO 

ko  *  V2Fn  (4*6) 

0  * 

In  the  case  of  Hoerl-Kennard  ridge  regression,  where  £  >0  and  J  »  Jt,  (4.6)  becomes 

2  *  * 

n„s  F  ..  /B'B.  If  additionally  we  replace  F  by  1,  this  becomes  the  well-known 

o  n^,V|0  ^ 

prescription  for  k  proposed  by  Hoerl,  Kennard,  and  Baldwin  (1975). 

4.2.  Directional  compatibility  tests. 

The  compatibility  of  the  external  data  with  the  experimental  data  can  also  be  tested 
with  respect  to  specific  linear  combinations  «  t' £  of  interest,  where  £’  is  in  the 
row  space  of  XQ  (or,  equivalently,  the  row  space  of  J).  Since 

‘V't  "  ~  N(0.°2i'<(i'lS)~1  +  k”’l')t),  (4.7) 

a  test  statistic  for  compatibility  associated  with  the  direction  t  is: 


*  *  ? 

(t’fi  -  £’&  ) 

♦t^, - -j - — - 2  ' 

£•[(£•*>  +  k  *T  It  s“ 


(4.8) 


which  is  distributed  as  F^  y,  (i.e.,  i|>^  (k)  is  distributed  as  |tv|).  Here  we  have 
used  the  notation  T  to  refer  to  the  Moore-Penrose  generalized  inverse  of  T,  to  cover 
cases  in  which  T  may  be  singular  (ng  <  p). 

4.3.  Maximum  “safe"  k. 

The  direction  in  which  the  incompatibility  between  experimental  and  external  data  is 
most  statistically  significant  can  be  found  by  maximizing  the  numerator  of  ( k )  in  (4.8) 


while  holding  the  denominator  fixed.  This  is  simply  a  matter  of  maximizing  a  non-negative 
definite  quadratic  form  over  a  given  contour  of  a  positive  definite  quadratic  form.  The 


1 


I 


result  is 


4  tic)  =  max  *t()c)  =  s  (max.  eigenvalue  of  £(k)) 


s~2q(k) 


C(k)  =  X'XfX'X  +  kT)“  kT(g.-j|  ) ( 3-j3  >’,  (4.10) 

and  is  independent  of  the  particular  contour  over  which  the  maximization  is  done.  Since 

the  rank  of  C  is  one.  the  maximum  eigenvalue  of  C  is  the  same  as  the  trace,  which  is 

*  * 

q(k).  The  maximizing  direction  t  is  proportional  to 
We  may  now  define  the  maximum  "safe"  k,  k^,  by 

qtk^)  «  s2F1 <U;a  (4.11 ) 

for  some  suitably  chosen  percentage  point  of  the  y  distribution.  For  k  <  k^,  no 
directional  compatibility  test  statistic  will  be  significant  at  level  a. 

4.4.  Lower  bounds  for  k. 

Compatibility  tests  are  useful  for  setting  upper  limits  on  k,  but  not  for  setting 
lower  limits,  since  the  closer  k  is  to  zero,  the  more  compatible  is  the  external  data 
with  the  experimental  data. 

Recall  that  the  reason  we  resort  to  the  use  of  external  data  in  the  first  place  is 
that  the  data/model  for  the  experiment  are  inadequate,  i.e.,  the  confidence  intervals  for 
some  estimates  of  interest  are  too  wide.  This  point  of  view  immediately  establishes  as  a 
lower  bound  for  k  the  smallest  value  that  ensures  that  the  confidence  interval  (2.18)  for 
all  t'g  of  interest  will  be  "sufficiently  narrow".  If  this  is  achieved  at  k  =  0,  we 
would  be  happy  with  the  standard  least  squares  analysis,  since  we  would  not  need  to  use  the 
external  information,  with  all  its  attendant  difficulties. 

There  will  be  some  difficulty  in  practice  if  one  is  unwilling  or  unable  to  specify  the 
maximum  interval  width  "necessary"  for  the  interval  to  be  "useful".  We  would  recommend 
that  the  end  points  of  the  interval  (2,18)  be  plotted  as  a  function  of  k,  or  as  a  function 
of  the  more  meaningful  parameter  P  to  be  defined  in  equation  (4.13).  It  should  then  be 


apparent  how  much  one  gains  (in  the  sense  of  narrowing  the  interval)  and  at  what  cost  (in 


the  sense  of  increasing  the  reliance  on  the  external  information).  This  is  unavoidably  a 
subjective  judgment,  but  one  which  we  feel  is  more  relevant  to  the  choice  of  a  reasonable 
value  of  k  than  is  the  'stability*  of  the  Koerl-Kennard  (1970)  'ridge  trace*,  a  criterion 
frequently  used  by  practitioners  of  ridge  regression. 

Our  approach  will  clearly  lead  to  different  choices  of  k  for  different  end  uses, 
where  each  end  use  is  the  estimation  of  a  linear  combination  of  regression  coefficients. 

For  some  end  uses  it  may  not  be  necessary  to  bring  in  any  external  information  at  all, 
while  for  others,  the  external  information  may  need  to  be  weighted  heavily.  If  this  is  too 
much  trouble,  however,  one  may  choose  an  overall  measure  of  precision  to  plot  against  k 
instead.  A  reasonable  choice  would  be  (k ) ,  the  average  estimated  variance  of  a 

fitted  value  over  the  region  of  interest  in  R.  Using  (2.14)  we  obtain 

V  (k)  -  a2(k)trM(XfX  ♦  kT)*1  (4*12) 

avg  ~  ^  ~  ~ 

where  N  is  the  region  moment  matrix  (E^(g(d)jc' (d) )  for  a  uniform  probability 

~2 

distribution  $  over  the  region  of  interest)  and  o  (k)  is  given  by  (2,16). 

Alternatively,  if  one  were  interested  only  in  changes  in  response  from  one  set  of 
conditions  to  another,  one  would  replace  M  in  (4.12)  by  2({J  -  )  where 

£  »  E^(jx(d ) ) .  Then  (4.12)  would  be  the  average  estimated  variance  of  the  difference  in 
predicted  response  between  two  randomly  chosen  points  in  the  region. 

4.5.  Percent  information  attributed  to  external  data. 

In  communicating  the  results  of  a  regression  analysis  of  the  sort  we  are  considering 
here,  it  is  helpful  to  provide  a  quantitative  measure  of  the  shares  of  information 
attributable  to  the  experimental  and  external  data.  Again  we  rely  on  the  work  of  Theil 
(1963),  who  showed  that  the  function 

P  -  100p”' trace (kT( X  +  kT)”’)  (4.13) 

is  the  only  function  that  meets  a  certain  set  of  reasonable  criteria  for  expressing  the  per 
cent  information  In  J3(k)  attributable  to  the  external  data.  Interchanging  Jl* jC  with 
kT  in  (4.13)  yields  the  percent  information  attributable  to  the  experimental  data.  This 


is  easily  seen  to  equal  100  -  p,  as  it  should.  An  attractive  feature  of  this  definition 
is  that  p  is  invariant  under  nonsingular  linear  reparametrizations  of  the  model. 


In  plotting  the  end  points  of  intervals  of  type  (2.18)  as  recommended  above,  it  may  be 
useful  to  use  p  as  the  horizontal  axis  rather  than  )c  itself.  We  feel  that  p  should 
be  computed  in  any  case,  since  it  is  more  easily  interpretable  than  k,  and  since  its  use 
places  proper  emphasis  on  the  fact  that  £  is  the  direct  result  of  incorporating  external 
information. 


-19- 


5.  BAYESIAN  METHODS 

As  one  might  expect,  there  is  a  Bayesian  parallel  to  the  foregoing  regression  analysis 
for  fixed  k  using  mixed  estimation.  The  evaluation  of  the  choice  of  k  using  tests  of 
compatibility  is  not  a  strictly  Bayesian  concept,  however,  but  is  a  diagnostic  check  in  the 
sense  of  Box  (1980). 

We  shall  review  here  what  we  would  consider  to  be  a  standard  Bayesian  approach,  based 
on  "noninformati ve"  priors,  and  shall  indicate  that  the  modifications  needed  to  make  it 
tractable  and  useful  lead  to  the  analysis  we  have  described  above.  We  shall  utilize 
primarily  the  results  of  Box  and  Tiao  (1973)  and  Box  (1980).  The  reader  is  also  referred 
to  the  paper  of  bindley  and  Smith  (1972)  for  additional  discussion. 

We  assume  the  following  subjective  probability  model 


P<^s2,8*,8.o2.o2,  -  (a2,"{1+n0/2)(a2,-C1^/2>(a2)V/2-1 


exP[--Ug-e  )-T<e-g*>  — 

2O02  20  20* 


2  2  * 

This  may  be  derived  by  assuming  that  in  in  o  ,  and  $  have  independent  locally 


uniform  prior  distributions  and  that 

“  (o0>  °/Z  exp(-  -h 

202 


“  <02)-n/2  exp(-  -L<£-£).rX(£-£)), 


p(32|g.g*g.o2o2)  “  (sV/2*1  exp[ -Vs2/2o2] .  (5.4) 

This  model  is  equivalent  to  that  considered  by  Box  and  Tiao  (1973,  Ch.  9),  for  the 
estimation  of  common  regresison  coefficients  using  data  from  two  independent  sources. 

Direct  adaptation  of  their  results  yields  the  following  posterior  density  for  8: 

P<£iS.s2.g*>  “  (<8-{j)'X'X<e-S)  +  vs2rn/2[(S-£*>VT(£-g*))  q/  .  (5.5) 

Unfortunately,  this  is  an  improper  density  since  it  is  not  inteyrable.  (The  portion  of  the 

* 

integral  in  the  neighborhood  of  3  can  be  made  arbitrarily  large.) 
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1 


(5.6) 


/Vn  alternative  approach  would  be  to  fix  by  assumption,  and  then  use 

P(6,s  ,3  |oq)  as  the  basis  for  a  diagnostic  check  on  0Q,  in  the  spirit  of  Box  (1980). 
The  posterior  distribution  of  j3  is  then 

p<gii'S2,g*,o2)  „  •x*x(£-^)+vS2i-n/2 

X  expt -  (£-8*>'T(B-B*) J. 

2O0 

Although  this  is  a  proper  density  of  relatively  simple  form,  we  do  not  know  of  an  easy  way 
to  obtain  the  posterior  mean  and  variance  of  g.  (One  could  do  it  by  integrating  the 

*  2  *  2  2 

moments  E(j5)  and  E(gg'),  conditional  on  £,s  ,g  ,o  ,oQ,  numerically  with  respect  to 
2  2 

do  /a  . )  Alternatively,  ar\e  could  seek  the  mode  of  (5.6)  for  use  as  an  estimate  of  B. 
This  would  require  an  iterative  procedure  similar  to  that  suggested  by  Lindley  and  Smith 
(1972),  who  pointed  out,  however,  that  this  provides  only  a  point  estimate  of  3.  A 
quadratic  Taylor's  series  expansion  of  the  logarithm  of  (5.6)  about  the  mode  would  lead  to 
an  approximate  variance-covariance  marix.  Even  if  all  this  were  done,  however,  there  are 

further  difficulties  in  obtaining  an  appropriate  test,  in  the  sense  of  Box  (1980),  for  use 

2 

as  a  diagnostic  check  on  the  assumed  dQ. 

AH  things  considered,  therefore,  it  is  not  suprising  that  we  should  want  to  fix  k 
2 

by  assumption,  instead  of  o^,  as  Box  (1980)  did  in  an  illustration  of  his  estimation 
/criticism  analysis.  Box  recommended  that  estimation  be  based  on  the  (multivariate  t) 
posterior  distribution  for  g: 


-  2  .  r  -In 0+n>/2 

P<fil£,s2.£  .*>  «  [l  + - C5 - ] 

<v+no)o 


(5.7) 


—  —2 

where  g  is  given  by  (2.13)  and  <7  is  given  by  (2.16).  In  particular,  posterior 
confidence  intervals  for  any  linear  combination  of  the  regression  coefficients  are  given  by 
(2.18). 

For  diagnostic  checking,  we  note  th<*t 

*2*.  2  v/2-1  2  “(VV)/2 

P(g,s  ,g  |k)  «  (s' )  '  <q(k)  +  vs*)  .  (5.8) 
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If  we  regard  this  as  a  likelihood  function  for  k  ,  then  the  likelihood  ratio  test  of  any 
hypothesized  k  is  based  on  the  statistic  in  (4.1),  i.e.,  it  is  the  same  as  Theil's 

test  for  compatibility  between  experimental  and  external  information.  Box  (1980)  also 
derived  <>(k)  as  one  of  several  diagnostic  checks  based  on  the  distribution  of  the 
experimental  data  conditional  on  all  the  underlying  assumptions,  and  noted  its  equivalence 
to  Theil's  statistic. 
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